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Abstract —Cloud mobile computing enables the offloading of 
compntation-intensive applications from a mobile device to a 
cloud processor via a wireless interface. In light of the strong 
interplay between offloading decisions at the application layer 
and physical-layer parameters, which determine the energy and 
latency associated with the mobile-cloud communication, this 
paper investigates the inter-layer optimization of fine-grained 
task offloading across both layers. In prior art, this problem 
was formnlated, under a serial implementation of processing 
and communication, as a mixed integer program, entailing a 
complexity that is exponential in the number of tasks. In this 
work, instead, algorithmic solntlons are proposed that leverage 
the structnre of the call graphs of typical applications by means 
of message passing on the call graph, under both serial and 
parallel implementations of processing and communication. For 
call trees, the proposed solutions have a linear complexity in 
the number of tasks, and efficient extensions are presented for 
more general call graphs that include ’’map” and ”reduce”-type 
tasks. Moreover, the proposed schemes are optimal for the serial 
implementation, and provide principled heuristics for the parallel 
implementation. Extensive numerical results yield insights into 
the impact of inter-layer optimization and on the comparison of 
the two implementations. 

Index Terms —Cloud mobile computing. Message passing. 
Inter-layer optimization. Dynamic programming. 

I. Introduction 

With the current widespread use of smart phones, there 
is an increasing demand on the users’ part for applications 
that require heavy computations to be run on battery-powered 
mobile devices, such as video processing, gaming, automatic 
translation, object recognition and medical monitoring. Of¬ 
floading energy-consuming tasks from a mobile device to a 
cloud server - known in the literature as cyber foraging, 
computation offloading Q and, more commonly, cloud mobile 
computing 0 - provides a viable solution to this problem, as 
attested to by systems such as Google Voice Search, Apple 
Siri and Shazam and by implementations such as MAUI 0 
and ThinkAir 0 - 

A mobile application can be partitioned into its component 
tasks via profiling, producing a call graph for the program 0. 
The call graph describes the functional dependence between 
the different tasks (see Fig. for an example). Offloading can 
either take place at the coarser granularity of entire applica¬ 
tions, as in, e.g., 0, or at the finer scale of individual tasks. 
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Fig. 1. An example of a call graph Q = (V, 

see 0. In the latter case, each task may be either offloaded 
to the cloud or performed locally. Moreover, processing and 
communication processes can either be implemented one after 
another in a serial fashion, as assumed in most prior art, or 
may be parallelized in the case of non-conflicting tasks as in 

0 0 - 

State of the Art; The large majority of prior works on the 
subject of optimal fine-grained offloading tackles the problem 
on a per-mobile basis, and assumes a fixed physical layer, 
which provides given information rate and latency. Examples 
of this approach for the serial implementation include 0, 
which uses a graph partitioning formulation; flO) , which 
presents a heuristic on-line approach to task partitioning to 
improve latency; and 0 and (T^, which assume a time- 
varying channel and propose adaptive solutions based on 
Lyapunov optimization and a constrained shortest path prob¬ 
lem, respectively. Instead, for the parallel implementation, 
references 0 0 propose a dynamic programming solution, 
again with a fixed physical layer. 

While the assumption of a fixed physical layer made in 
all reviewed works simplifies the problem formulation, there 
is an evident interplay between decisions at the physical 
layer and offloading decisions at the application layer. Most 
fundamentally, the choice of the physical layer mode, e.g., 
of the transmission power and information rate, determines 
the mobile energy consumption, as well as the corresponding 
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latency, for mobile-cloud communication. Therefore, a proper 
adaptation of the physical layer is instrumental in making 
cloud mobile computing viable. 

Recognizing this critical interplay, more recent work has 
tackled the inter-layer optimization of the physical and of the 
application layers. Specifically, references GD GD studied 
this problem for a general network of interfering mobile 
devices by assuming coarse-grained offloading. Fine-grained 
offloading is instead studied in p3) , where the authors focus 
on a per-mobile formulation under a serial implementation. To 
reduce the complexity of the resulting mixed integer program 
in GH, a method is proposed that limits the exponential 
number of alternative offloading decisions based on feasibility 
arguments. Furthermore, for fixed offloading decisions, the 
problem is shown to have useful convexity properties. A 
similar problem formulation is also studied in p 6 ) . 

Main Contributions: In this paper, we investigate the per- 
mobile inter-layer fine-grained optimization of offloading de¬ 
cisions at the application layer and of the transmission powers 
at the physical layer, with the aim of minimizing energy 
and latency for both serial and parallel implementations. As 
discussed, prior works, including p3| | fT 6 t , formulate the 
problem as a mixed integer program, whose complexity is 
exponential in the size of the call graph. Here, instead, we 
start from the observation that most call graphs have specific 
structures that can be leveraged to reduce the computational 
complexity. For instance. Fig. shows a typical example of an 
application that is composed of “map” tasks, which perform 
operations such as filtering, features extraction or sorting, and 
allow the successive tasks to be decomposed into independent 
operations (see tasks T2, T3, T4); along with “reduce” tasks, 
which perform summary operations such as classification or 
regression (see tasks Tio, Tn and T14). This paper shows 
that, for structured graphs, solutions based on message passing 
can be developed for the both standard serial implementation, 
(see Sec. as well as the parallel implementation (see Sec. 

0 . 

In particular, for applications with a tree structure, such as 
the subtrees 71 and T 2 in Fig. [T] we develop optimal efficient 
message passing algorithm for the serial implementation, 
whose complexity is of the order 0{\V\din), where |V| is the 
number of nodes of the call graph and din is the maximum 
in-degree. For the more challenging parallel implementation, 
the proposed method yields a principled suboptimal scheme 
whose complexity is of the same order as for the serial case. 
The performance of this scheme is evaluated by means of a 
dynamic model also introduced here. For more general call 
graphs, such as the one in Fig. [T] we generalize the proposed 
solutions to yield a complexity of the order |V|dm), 

where | Vs | is the number of nodes that, if removed, decompose 
the graph into subtrees (such as T 2 , T 3 and T 4 in Fig. [T] so 
that I Vs I = 3 for this call graph). With reference to prior 
work, we note that the proposed approach for parallel case 
generalizes the schemes in |[7| and 0 by encompassing also 
the optimization of the physical layer. Extensive simulation 
results, presented in Sec. bring insight into the impact of 
inter-layer optimization and of the call graph structure on the 
performance of the cloud mobile computing. 


Notation: Throughout, we use the 


e.g., 1171. Accordingly, for a graph Q — 


an incoming edge from another node b 
of the parent node b. 7^(n) and C{n) 
parents and children, respectively, of 
a set Vl C N, where N is the set of 
Xi with i S N, Xj^ is the set defined 
similarly, for variables Xi j with j G N, 

Xj\ j — ^Xi j^i G -T}. 


graph terminology of, 
(V, £), a node a with 
is referred to as a child 
are the sets containing 
a node n G V. Given 
integers and variables 
as Xj{ = G Vl}; 

Xy[ j is the set defined 


II. System Model 

We consider a per-mobile problem formulation in which 
a mobile aims at running a given application with minimal 
energy expenditure and latency. For this purpose, the mobile 
may offload some of the computing tasks to a cloud processor, 
also referred to as server. We consider a configuration with a 
single processor both at mobile and cloud. We start in this 
section by introducing the key quantities at the application 
layer and then at the physical layer. 


A. Application Layer 

A computer application can be described by its call graph 
0. A call graph Q = (V,.?) is a directed acyclic graph which 
is used to represent the casual relation among the tasks in 
which a program can be partitioned. An example is shown in 
Fig-G] Each vertex, or node, in V represents a particular task 
to be carried out within the application, e.g., data preparation, 
edge recognition or transform coding. We denote the task 
nodes as V = {Ti,...,T|y|}. However, we will also use the 
shortcut notation n G V in lieu of T„ G V, where no confusion 
can arise. In the call graph (/, a directed edge (T^, T„) G S 
with Tm G V and T„ G V denotes the invocation of a “child” 
task T„ by a “parent” task T^. 

Each task node T„ is characterized by a parameter Vn, 
which is the number of CPU cycles required for task T„ to 
be completed. Let us define as /* and ffl the number of CPU 
cycles/sec that can be run at the mobile (i.e., locally) and the 
cloud (i.e., remotely), respectively. The latency LJj = Vnjf 
is then the time required to compute task T„ locally and 
= '^nj is the latency to run that task remotely in the case 
the respective processors are devoted only to the completion 
of task T„. Each edge (Tm,T„) G £ is instead labeled by 
the number of bits bm,n that must be transferred by the parent 
task Tjn in order to allow the computation of the child task 
T„. 

To complete the description of the quantities of interest 
at the application layer, we introduce the offloading decision 
variables. Specifically, we define G {0,1} as the indicator 
variable that determines whether task T„ should be executed 
locally or remotely, where = 0 indicates the local execution 
of the task and /„ = 1 represents the offloading of the 
task to the remote server. Not all the tasks may be eligible 
for offloading. In particular, a mobile application typically 
operates on input data, e.g., images or videos, that reside in 
the mobile device. This can be accounted for by identifying 
a subset Vd C V of task nodes that represent input data 
preparation processes, such that for every task G Vd 
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we have Im — 0, i.e., local processing. These nodes are 
assumed to have no parents and have the role of initializing 
the application (see, e.g., Q Q). For instance, in Fig. we 
may have Vd = {Ti}. Moreover, for any graph, we assume, 
without loss of generality, that there is a final task to be 
carried out at the mobile that has no children and completes 
the application by, e.g., showing the results on the mobile 
screen. An example is task T 15 in Fig. [T]for which we then 
have /i 5 = 0 . 

B. Physical Layer 

We now describe the parameters and the optimization 
variables relative to the physical layer. The parameter 
represents the local processing power of the mobile and 
is the power required to keep the mobile’s RF circuits active 
during both transmission and reception, while P'’® is the power 
needed to process the received baseband signal for decoding at 
the mobile. All powers are measured in Watts. The parameter 
(bits/s) is the downlink capacity available to transfer the 
information bits from the server to the mobile. Uplink and 
downlink are assumed to be operated over orthogonal spectral 
resources. 

The optimization variable P^^ is the uplink power used 
by the mobile to transfer the necessary bm,n bits in case a 
parent task T^, is run locally {Im = 0 ) and a child task 
T„ is performed remotely {In = 1) for all (Tm,T„) G S. 
Note that we allow the uplink transmit powers Pmn to be 
different for every edge in £, hence enabling a more flexible 
joint optimization of application and physical layers as in HD- 
Given an uplink power P, we denote as 

C-(P) = Bl„g,(l + ^) ,1, 

the uplink rate (bits/s) between the mobile and the server, 
where 7 accounts for the channel gain between mobile and 
the server, B is the available bandwidth and Aq (Watts/Hz) is 
noise power spectral density. 

Ill. Problem Formulation 

In this work, we aim at optimizing the application layer vari¬ 
ables I = {In}^nh^ "'tth In = 0 for 71 € Vd and for the root 
node, and the physical layer variables P = {Pm n}{m,n)ee- 
We consider separately serial and parallel implementations. 

A. Serial Implementation 

In this section, as in most prior work, we assume that at 
any time, only one operation, either computation or commu¬ 
nication, may take place, either at the mobile or at the server. 
Therefore, the operations needed to run a given application 
are performed in a serial fashion one after another. Note that 
the order in which these operations are scheduled is arbitrary 
as long as it is consistent with the procedures encoded in 
the call graph. For instance, for the tree 7i in Fig. if 
/s = /g = /i 3 = 0 and Iio = 1 , tasks T 5 and Tg can be first 
carried out in any order at the mobile; then, 65 10 and &g 10 bits 
are transferred in the uplink in any order; then, node Tig is 


processed at the cloud; and finally 610,13 bits are downloaded 
by the mobile, which performers task T 13 . 

Under a serial implementation, the overall latency is the 
sum of all the latencies required to communicate and compute 
across all task nodes, which can be written as (see also (ID) 

|V| |V| 


n—1 

|V| 


n—1 


rdl 


n—1 rnGV{n) 




( 2 ) 


where Ln{In) = (1 ~ In)Ln + InLn denotes the delay 
required to perform the computations associated with task 
T„ either locally or remotely; L'^ nil{m,n}, Pm,n) = -^n(l “ 

l m) bm,n/C'^^{Pm n) accounts for the delay caused by the 

transfer of bm,n bits to the server if task T„ is offloaded 
(/„ = 1 ) but Tm is not {Im = 0 ); = (1 - 

l n) Imbm,n/C‘^^ represents the latency caused by the transfer 
of bm,n bits at the mobile if Tm is offloaded {Im = 1 ) and 
T„ is run locally {In = 0). 

The energy spent by the mobile for given variables is 
similarly given as the sum (see also ED) 

|V| |V| 

E(l,P)=J2EniIn) + Y. E 

n=l ra=l mGT’(n) 

|V| 

+ E E 

n —1 rn£V{n) 

(3) 


where the term E^{In) = (1 — measures the 

energy consumed by the mobile to perform each task T„ 
locally if In = 0 ; the term P“'„) = {P^n + 

P''^)Pm,n{I{m,n}:Pm,n) the energy required, for a task T„ 
with /„ = 1 , to transfer information from all the parent tasks 
m S Pin) that are performed locally, namely with Im = 0 ; 
and Anally = {P^f + P^-)L{^^nihm.n}) is the 

energy consumed, for a task T„ with = 0 , to transfer 
and decode the information in the downlink from parent tasks 
m G V{n) with Im = 1 - 


B. Parallel Operation 

As an alternative to the serial operation discussed above, 
we now consider an implementation that allows to potentially 
reduce the latency by parallelizing computing and communi¬ 
cation. This implementation was implicitly assumed in Q Q 
but without consideration for the optimization of the physical 
layer. According to this implementation, tasks are processed 
as soon as they receive the necessary information from their 
parents. It is then possible for uplink transmissions, downlink 
transmissions, local and remote computations to occur at the 
same time. 

As an example, consider the call tree 72 in Fig. [T] with 
I^ = Ig = Ig = I 14 . = 0 and III — ^12 — 1 - An 
illustrative timeline is shown in Fig. |D where CP* denotes 
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Fig. 2. An example of a timeline for the parallel implementation of the call 
tree 75 in Fig. [T] with Iy = Ig = Ig = In = 0 and In = In = 1. 

local computing and denotes remote computing; UL 

indicates that the task is uploading information bits in the 
uplink; and DL means that the task is receiving information 
from one or more of its parent task nodes in the downlink. 
It can be seen that, for instance, task Tn can be processed 
remotely as soon as the information from tasks Ty and Tg 
has been received by the server at time fg, while uplink 
transmission for task Tg may be still ongoing. Observe that, 
whenever multiple concurrent uplink/downlink transfers take 
place at the same time, the uplink/downlink spectral resources 
have to be properly divided (e.g., for tasks T7, Tg and Tg at 
time ti). This requires an adequate allocation of the spectral 
resources, such as time-frequency resource blocks in LTE. An 
analogous discussion applies to the computational resources. 

Assuming the feasibility of allocating communication and 
computation resources as discussed above, the Appendix de¬ 
tails a dynamic model that enables the evaluation of the 
energy and latency of the parallel implementation for given 
physical- and application-layer variables P and I. This frame¬ 


work will be used in Sec. VI to evaluate the performance of 


the parallel implementation using numerical results. However, 
the framework in the Appendix does not lend itself to the 
development of efficient optimization algorithms due to the 
complexity of accounting for the mentioned reallocation of 
the communication and computation resources. In Sec |V] we 
develop useful heuristics for this purpose. 


C. Problem Formulation 

In order to optimize physician and application layer vari¬ 
ables, we consider two different standard approaches (see, e.g, 
p 8 )). In the first problem formulation, a weighted sum of 
energy and latency is minimized via the problem 

[P.l] minimize i5(I, P) -f AL(I,P), ( 4 ) 

where A is a non-negative constant that determines the trade¬ 
off between energy and latency and can be interpreted as a 
Lagrange multiplier. By varying A, one can explore the trade¬ 
off between latency and energy GD- An alternative problem 
formulation is to minimize the energy 0 with a latency 


constraint as 


[P.2] minimize i?(I, P) 

subject to T(I,P) < Lmax, 


(5) 


where L^ax is the maximum allowed delay. Note that, in Q 
and Q, the domains of variables I and P are implicit. As it will 
be illustrated in the next sections, it is analytically convenient 
to tackle problem [P.l] for the serial implementation and 
problem [P.2] for the parallel implementation. 

Remark 1. References 0 § tackled problem [P.2] for the 
parallel implementation under the assumption that the call 
graph is a tree or a parallel/serial combination of trees, and 
assuming that the physical-layer parameters P are not subject 
to optimization. Moreover, the papers Q |j^ implicitly assume 
that parallel communication and computation do not entail a 
division of the available resources, hence bypassing the issue 
discussed above. Under these assumptions, it is shown that 
the problem can be efficiently, albeit approximately, solved 
via dynamic programming by quantizing the set of possible 
delays. Reference | [T5) studied instead problem [P. 2 ] for the 
serial implementation. The solution given in HD prescribes 
a properly pruned exhaustive search over the variables I, 
and leverages the fact that, for a fixed I, the problem of 
optimization over P, upon a proper change of variables, is 
convex. 


IV. Optimal Task Offloading for Serial 
Processing 

In this section, we tackle problem [P.l] for serial processing. 
The key idea of the proposed approach is to leverage the 
factorization of the objective function in [P.l] in order to apply 
the min-sum message passing algorithm. We first detail the 
mentioned factorization in Sec. |IV-A| Then, in Sec. |IV-B[ we 
discuss the proposed efficient optimal method based on min- 
sum message passing GZl for the special case of a call tree. 
Then, in Sec. IV-C we extend the proposed algorithm to call 
graphs with more general structure. 


A. Factorization of the Cost Function 

The objective function for problem [P.l] can be factorized 
over the task nodes as follows: 

^ ( 6 ) 

riGV 

where the factor $n(-f{n}uT’(n)) „) accounts for the 

weighted sum of energy and latency associated with the 
local or the remote computation of node T„ and with the 
transmissions in uplink and/or downlink related to the edges 
connecting the parents of node T„ to node T„. This function 
is given, from 0 and ([^, as 

[l{n}VV(npP^U,n) = (1 ” + AL= (/„) 

(7) 

m^V{n) 
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Fig. 3. The clique tree Tc corresponding to the call tree Ti in Fig. [T] 


We now show that the optimization in [P.l] over the trans¬ 
mission powers P can be carried out analytically, yielding new 
factors that are independent of the powers. In fact, given that 
each power appears separately in the factors of 0, the 
optimization of all powers can be carried out independently. In 
particular, the optimum power for edges (m,n) S £ 
is given by the solution of the problem 


pul 

-^m,n 


= arg 


min 

P^\^>0 


P:^!,n+P^^+^ 


( 8 ) 


As discussed in p3] , the optimization problem in (|^ be¬ 
comes strictly convex with the change of variables ym,n = 
P'^’'(Pmn) ^rid hence its unique solution can be easily found. 
Note that the optimum values for {m,n) G f are 

equal. 

Substituting the optimum powers from into the 
problem [P.l] can be rewritten as 


[P.l] minimize ^ {l{n}uv(n)) , 

riGV 

where we have defined the factors 

{l{n}U'P{n)) = ^f{n}UP(n)) Pp(n)• 


( 9 ) 

( 10 ) 


B. Message Passing for a Call Tree 

For a given call tree T, as for 7i and T 2 in Fig. [T] the 
problem [P.l] in can be solved exactly via the min-sum 
message passing algorithm with a complexity of the order 
0{\V\din), where dm is the maximum in-degree in the call 
graph. We refer to E) for an introduction to message passing 
algorithms. 

The algorithm operates on a clique tree Tc that is associated 
with the call tree T. The clique tree 7^ can be constructed 
from T as follows; (/) replace the directed edges in T with 
undirected ones; and (ii) substitute each task node T„ in T 
with a node of %, which we label as the nth cluster node. 
Each cluster node n is assigned the factors {l{n}u'P{n)) 
in Each edge that connects clusters n and m is labeled 
with the variable that appears in both clusters n and m. 
An example of a call tree and its corresponding clique tree is 
illustrated in Eig. 

Once the clique tree is constructed, the min-sum message 
passing algorithm can be directly obtained following the 
standard rules as detailed in GH Ch. 10]. To elaborate, we 
define {P^(n), P'’(n)} as the message sent by the nth cluster 


TABLE I 

Message Passing Algorithm eor the Serial Implementation 


1: Calculate the powers ^ for all 

(m, n) £ £ using ly. 

2: Build the corresponaing clique tree as explained in Sec. 
Fig.j^. 

3: for n = 1:|V| do 
if n is a leaf cluster 
E‘{n)= 0 
E^{n) = 00 

else 


IV-B 


(see 


Update Efn) and E’'{n) by using |TT| 
lljiln) and I^(n) for all m £ P(n) 
as explained in Sec. 


and 1121 and calculate 


IV-B 

4: Trace back the optimum decisions. 


node on the edge labeled by to its child cluster, where 
E^{n) is the value of the message corresponding to /„ = 0 
(local processing) and E^{n) is the value of the message for 
/„ = 1 (remote processing). Note that the definition of the 
parents and children nodes follows that used for the call tree 
T- The messages of the clusters that are not leaves can be 
calculated recursively as 

p\'n)= X! min {E;'(to)- f 0,/m = 0), 

mGP(n) 

£;’'(to) + $„ (4 = o,/m = i)}, 

( 11 ) 


and 


E'^{n) = min I Ei* (to) 

mGP(n) 


+ {^n — I5 — 0) j 


E;’'(to) -I- {In = 1 , Em = 1 ) |- 


( 12 ) 


In order to keep track of the optimal decision I, for each cluster 
n and parent cluster to, we also define the functions E)^(n) 
and P^{n), where we have E)^(n) = 0 if the first argument in 
the min operation in (11 1 is smaller and llnin) = 1 otherwise; 


and Im{n) is defined analogously with respect to (12i. 


As detailed in Table [I] the messages are first sent by the 
leaf clusters, and then each cluster transmits its message 
{E^^(n),Ei'’(n)} to its child cluster as soon as it has received 
the message from all its parents. The message passing algo¬ 
rithm is detailed in Table [I] The optimum decisions are finally 
obtained via backtracking, starting from the root node V so 
that for any node n and every parent to G Pin), we set 
Im = Im{n) if En = 0 and Em = Pm{n) otherwise. Erom 
and 0, the complexity of serial implementation is of 
order 0 {\V\dm), since every node needs to sum at most dm 
metrics, each of which only requires two sums and a binary 
comparison. 


C. Message Passing for a General Graph 

In the case of a more general call graph Q, it is not possible 
to directly convert the call graph to a clique tree as done above 
for a call tree. 

We outline here two solutions to this problem. Eirst, assume 
that the call graph is such that by removing a small number 
subset Vs of nodes, one can partition the graph into subtrees. 
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This is the case for typical graphs, such as that in Fig. [Tl with 
a small number of “map” and “reduce” nodes (see Sec.^. For 
such graphs, similar to the observation in Q, one can apply 
message passing scheme introduced above on each subtree 
for all possible instantiations of the offloading decisions for 
the mentioned fixed nodes. Then, the minimum value of the 
function in is calculated over all such instantiations. The 
complexity of this approach is of the order 0 {2^^‘‘''\V\din). 

For graphs with an even more general structure, the junction 
tree algorithm can be applied to obtain a clique tree p7| Ch. 
10]. Once the clique tree is obtained, message passing can 
be implemented by extending the approach described in the 
previous subsection. The complexity of this scheme depends 
on the treewidth of the graph fl?) . In general, unless |Vs| is 
prohibitively large, the previous approach is to be preferred 
due to the possibility to reuse efficient algorithm in Table 


V. Optimization of Task Offloading for Parallel 
Processing 


In this section, we tackle the problem [P.2] in the presence 
of parallel processing. As for the serial case, we concentrate 
on call trees in Sec. |V-A| and in Sec. |V-B| we discuss the 
extensions to more general call graphs. 

As explained in Sec. Ill in order to evaluate energy and 
latency of a parallel implementation, one needs to keep track 
of the number of concurrent processes that use the local and 
remote CPUs as well as the uplink and downlink bandwidth. 
While the dynamic model presented in the Appendix is able 
to do so, its use for optimization appears challenging. Hence, 
in this section, in order to develop a useful optimization 
heuristic, we assume that the number of concurrent uploads, 
downloads, local computations and remote computations are 
fixed. Under this simplifying assumption, we propose an 
algorithm that solves problem [P.2] to any arbitrary precision 
with linear complexity via message passing, and, specifically, 
via dynamic programming. The performance of the obtained 
heuristic solution is then evaluated by means of the dynamic 
model described in the Appendix. 

To elaborate, we fix the number of concurrent upload and 
download transmissions to A^“* and respectively, and, the 
number of concurrently computed tasks locally or remotely as 
and W’’, respectively. The fixed values of A^“^ N'^^, 
and W’’ define parameters that can be set by the designer, 
yielding different optimization solutions that can be evaluated 
via the dynamic model in the Appendix. More discussion on 


the selection of these parameters can be found in Sec. VI 


Having fixed the mentioned parameters, the optimization 
proceeds as follows. To start, the available uplink and down¬ 
link capacities are obtained as 


/^uL / pul \ _ 

^par\'^m,n) 




) 


and = 


log2 (l + {2^" - 


Ndl 


(13a) 


(13b) 


which correspond to the rates achievable when the spectral 
resources, either in the time or in the frequency, are equally 
divided into N'^’’ and parts, respectively. Similarly, the 


frequency of the local and the remote processors can be 
obtained by 

fpar = Ip and = ^' (1"^) 

Following Q, we start by observing that, for each task T„, 
the delay required to complete the tasks of the subtree in Q 
rooted at any task node T„ can be calculated recursively, given 
that the completion of task T„ requires completion of all the 
parent tasks. Specifically the time Lpar(I,P) by which the 
subtree rooted at T„ is completed, given the decisions (I,P), 
can be written in terms of the same quantities for its parents 
as 

4"i(I, P) = max {4™)(I, P) + „(/{„,„}, Pr,„) 

mGV{n) k 

+^m,n(d{m.n})} + Lni^n), 

(15) 

where the is the latency of the subtree rooted at 

the parent node and the latency terms are defined as in 

Note that since /„ = 0 for the leaf nodes in V — D, we 
have Lpar(I, P) = 0 for n € Vd- The expression ( |l5[ ) can be 
then calculated recursively starting from the leaf nodes, and 
the final delay is given by Lpar(I,P) = (I;P)- 


A. Message Passing for a Call Tree 

In order to develop an approximate solution to problem 
[P.2] under the said assumptions (see (13i-(14i), as in 0, 
we partition the set of possible delays into K intervals by 
means of the quantization function 


q{t)=tk if t G {tk-i,tk], 


(16) 


where 0 < ti < t 2 < ■■■ < tx = Lmax are given predefined 
latency values. We take for simplicity tk = {k — l)e for 
a given quantization step e > 0. The algorithm presented 
below provides an approximation of the optimal solution of 
the program at hand, which, following the same arguments as 
in 0 0> become increasingly accurate as e becomes smaller. 

We define Tn as the subtree Q that is rooted at the task 
T„. Moreover, we let E^{n,k) denote the minimum energy 
needed to run the the tasks in %i if node T„ is executed 
locally and under the constraint that the latency is less than 
tk- Note that the energy E^{n,k) is minimized with respect 
to the offloading variables in vector I corresponding to the 
task nodes in the mentioned subtree except T„, as well as 
over the uplink powers in vector P corresponding to all the 
edges within the subtree. Similarly, we define E^{n, k) as the 
minimum energy cost for Tn if T„ is performed remotely 
and under the delay constraint tk- We also correspondingly 
define the set I^{n,k) = ^)}mGP(n) that contains the 

optimum offloading decisions for the parent nodes of node 
T„ if the latter is performed locally under the latency tk for 
the subtree rooted at T„. Similarly, we define X^{n,k) = 
m^v(n) the set containing the optimum decisions 
for the parent nodes of node T„, if the latter is performed 
remotely with the latency constraint tk- 
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The proposed dynamic programming algorithm computes 
the cost functions E^{n^ k) and E'^{n, k) and the sets k) 
and X^{n,k) recursively from the energy cost functions 
E\m,j) and E'’{m,j) of all the parent nodes m G 'P{n) 
under all the delay constraints tj with j = 1, ...,k — 1. 
Specifically, we set E\n,k) = oo and E^{n,k) = oo for 
k < 0. We can then obtain the recursive relationship 

E\n, k) = 




m^V{n) 

E^ [m,k-QiLl + 


ndl 

^par 


+ + P™) 


ndl 

^par 


(17) 
if f e 


where the function Q is defined as Q(t) = k 
[tk-i,tk) for all/c e {1, 

Equation accounts for the fact that the minimum energy 
cost required to run the task in the subtree Tn within a 
latency if T„ is run locally is given by the sum of the 
local processing energy (see in ([^) and of the 

energies required to run all the subtrees Tm with m G Vin). 
For the latter, each parent node Tm can be run either locally, 
requiring energy E\m,k — Q{L\^)), or remotely, with an 
energy E''{m, fc—+ We observe that, if node Tm 

^par 

is performed locally, the latency allowed for the subtree 7 m is 
tk — q{Ln) and hence the corresponding minimum energy is 
E\m, k — Q{V‘^)), and similarly for the case in which Tm is 


carried out remotely the energy can be calculated as in (17 1 . 


E'^{nT)= E min|((PE.fc + ^"0 

m£V(n) I \ 


nul (pul \ 
^par\^ m,n,k) 


+ E I 77i, k — Q I + 




nul (pul \ 
^par\^ 7n,n,k) 


(18) 


explained in an analogous fashion as for (171. Furthermore, 


TABLE II 

Dynamic Programming Solution for Parallel Implementation 

1: for n = 1:|V| do 
if T„ e Vd 

(n, k) = 0 for all k 

E'^{n, k) = oo for all k 

else 

for k = 1, K do 

Calculate the powers n k (™’ S £ using jl9|. 

Upda te E^{n, k), E’"{n, k), T'{n, k) and k) by using 

{17)-{T^. 

2: Trace back the op timu m decisions from _E7|F|, k) using the 
algorithm in Table |III| 


must be performed locally within the delay constraint Lmax, 
the optimum solution (I,P) can be found starting from the 
optimal decisions associated with by keeping 

track of the maximum allowed delay for each subtree Tn- 
The complete dynamic complete programming algorithm is 
presented in Table [H] and the backtracking method is explained 
in Table [111] 

Optimization of the powers is carried out by observing 
that, thanks to the decomposition made possible by dynamic 
programming, the powers P^* ^ f. appear in separate terms 


in (18i. Therefore, without loss of optimality, the powers 


Pmnk optimized separately from each term in (18i. 


In (17 1 , the min{-,-} operation accounts for the choice of 
whether node T„ should be performed locally or remotely. 
Accordingly, the set X^{n,k) = {lln iPi k)"\mGV(n\ can be 


This optimization is complicated by the presence of the non- 

). To address this issue, 
e 

(19) 


differentiable term —t 

^par\^Tn,n.,k) 

for each (m, n) G £ and each k G iT} we calculate 


P 


ul 


k = arg ^min ^ E^{n, k, Pm^J, 


where 


evaluated during calculation of P^(n, k) in (17i by observing 
which term in the function min{-, •} is smaller. Specifically, 
we can write P^{n,k) = 0 if the first term is smaller and 
/^(n, fc) = 1 otherwise. 

Similar to ([T7]i, we can also write 




CliLiP^r 


■^par 

+ P* [ 777, k — Q I Lh + 


(^ul (pul t 
^par m,nJ 


( 20 ) 


by solving k — Q(L'!^) + l convex subproblems. To this end, we 
note that the equality Q{Ll^ + bm,n/Cpir{Pm,n)) = 3 holds 
as long as the inclusion Pmn ^ '^m,nj is satisfied with 


where uplink Pm^k i^ selected as detailed below. The two 
arguments of the min{-,-} operator measures the energy 
cost of the subtree Tm in the case that the parent node 
Tm is performed locally or remotely, respectively, and are 


P,n,n,j = - Ij /y, _ ij /yj ^ 

( 21 ) 

where we defined 7 ' = 75 ^. We can then calculate P^n k 
in ( [T^ by first solving the problems 

fpul pr/N 

for all j G {Q(L^),..., k} and then set 


= arg 


min 


the set X^{n, k) = {/y (n, k)}m^'p{n) can be evaluated during 
calculation of E'^{n,k) in analogous fashion as fc). 

Once equations are evaluated starting from the 

leaf nodes of Q to the root, the optimum powers P and 
offloading decisions I are obtained via backtracking from the 
root to the leaves of Q. Specifically, since the root node 


Piti,n,k = arg . min 


f^ul (pul \ 
^par V'* m^nJ 

I f ^ 

,pul ,pprf\ 
v m,n,j ' J 


( 22 ) 


nul (pul \ 
^par m,n,j ) 


+ E I m, k — Q I + 


nul (pul 
^par\^ m,nj J 


( 23 ) 
























TABLE III 

Backtracking algorithm for TableHTI 


1: Set ^|v| — Lfjiax niid -^|v| — 

2: for n = |V| : 1 do 

for all m S Vin) do 
if/„ = 0 

if J^(n,Q(Ln)) = 0 

Set Im — 0 and Lrn — Ln 

else 

Set Im = 1 and Lm — Ln ^-^ri 

else 



if j;;,(n,Q(L.))_=0 
Set Im = 0, Pm ,71 

and Lm — Ln | 

else 

Set Im = 1 and L 


_ pul 

Tn,n,Q(Ln) 


= L^-Ll 


Each problem (j2^ becomes convex by means of the change 
of variable ym,n = C'^L{P^,n) 

Since the maximum number raconvex optimizations that 
need to be solved at each time instant for each node can be 
upper bounded by dinK, and K is proportional to 1/e, the 
complexity of the proposed algorithm in Table is given by 
0(|VM,„/e2). 


B. Message Passing for a General Call Graph 

Similar to Sec. |IV-Cl for a graph with the structure discussed 
in Sec. in the problem [P.2] can be solved, for hxed parameters 
N\ and by means of an exhaustive search over 

the offloading decisions of the nodes that, when removed, de¬ 
compose the graph into disjoint trees. Following the discussion 
in Sec. |IV-C1 the resulting solution has a complexity of order 
0(2l^-l|V|d„/e2). 


VI. Simulation Results 


In this section, we provide some numerical example based 
on the analysis developed in the previous sections. We start 
by considering the call tree in Fig. in order to simplify 
the interpretation of the results and gain an insight into the 
performance of the considered techniques. In this example, 
Ti 3 ,...,T 24 process input data present at the mobile device, 
represented by nodes Vd = {Ti,..., T12}, e.g., to extract some 
features, and then root node T25 performs a “reduce” oper¬ 
ation, such as classihcation, on the extracted features at the 
mobile (I 25 = 0). We set = 0.4 Watts, which is a common 
for smart phones |j^, fl^ , 1^ ; /* = 10® CPU cycles/s (e.g., 
Apple iPhone 6 processor has maximum clock rate of 1.4 
Ghz); = lO^o CPU cycles/s (e.g., AMD FX-9590 has a 
clock rate of 5 Ghz (2^); j/{BNo) = 27 dB, = 0 W, 
P™ = 0 W, B = 1 MHz, = 200 Mbits/s unless stated 
otherwise. For both the serial implementation (solid lines) 
and the parallel implementation (dashed lines), optimization 
is performed according to the algorithms described in Sec. IV 
and Sec. |V| respectively, and, for the parallel implementation, 
the performance is evaluated using the dynamic model pre¬ 
sented in the Appendix with step size = 0.1. For parallel 
optimization, we set in (13 1 and 


(14 1 to an optimized value in the range [1,4] and we have 
e = 0.1. Note that the performance of the optimization was 
found not to be significantly improved with smaller values 
of e and not to be increased by choosing larger values for 

]^ul _ pj-dl — ]\[l — ]\[r^ 


In Fig. the mobile energy cost for the serial and the 
parallel implementations are plotted versus the latency, along 
with their communication and computation components for 
the graph in Fig. with the selection of parameters marked 
as case (a) in the caption of Fig. The parameters of the 
graph are chosen to yield the same range of latencies and 
energy consumptions as in 0 and Q. With the selected 
parameters, performing the application locally requires an 
energy equal to 65.6 J and has a latency of 164 s (outside 
the range of Fig. Fig. 0 shows that signihcantly smaller 
latencies and energy expenditures can be obtained by properly 
optimizing the offloading decisions and the communication 
strategy. For instance, with an energy expenditure of 6.5 J, an 
optimized parallel implementation yields a latency of around 
20 s, while an optimized serial implementation requires a 
latency of around 45 s. 

The parallel implementation is shown here to have the 
potential to strictly outperform the serial implementation and 
to enable the operation at latencies that are unattainable with 
the serial implementation. Moreover, as the latency increases, 
the energy can be seen to decrease mostly due to the fact that 
the communication powers can be reduced. An exception to 
this trend is observed for the serial implementation around the 
latency L = 42 s, due to the fact that the optimum application 
layer decisions prescribe more tasks to be offloaded for L > 42 
s. 


In order to provide a further reference performance for inter¬ 
layer optimization, we consider a conventional separate design 
strategy, whereby: (/) the uplink transmission power for each 
task is obtained by imposing the constraint that transmitting 
in the uplink require a time no larger than that necessary 
to perform that task locally (see Sec. 3] for a similar 
approach); (ii) the optimization of the offloading decisions is 
carried out by following the proposed algorithms with a hxed 
physical layer, which amount to the schemes in Q Q for 
the parallel implementations. For the serial implementation, 
this separate approach yields a latency of 178 s and an energy 
expenditure of 9.7 J, which is outside the range of Fig. while 
for parallel processing the observed energy-latency power is 
illustrated in this hgure. Note that separate optimization does 
not attempt to adapt the physical layer to the application layer 
requirements and hence it yields a single energy-latency point 
in the considered latency range. 

Fig.|6] shows the energy-latency trade-off for the call graph 
in Fig.fflfor both case (a) and case (b) as detailed in the caption 
of Fig. Note that the separate optimization for case (b) with 
the parallel implementation yields E = 22.5 J for L — 38.5, 
which is out of the range of Fig. |^ The results in Fig. |^ 
suggest that the gains offered by the parallel implementation 
over the serial implementation depend strongly on the chosen 
call graph. 

To gain more insight into this point. Fig. |^ illustrates the 
timeline corresponding to the parallel implementation for case 
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Fig. 4. The call tree graph used for the examples in Fig. |5|7| The numbers 
shown next to the edges that ai'e connected to the input task nodes represent the 
sizes of input bits bm,n in Mbits and the numbers in the task nodes (circles) 
represent the number of CPU cycles v„ normalized by 10® CPU cycles (empty 
circles with ui = ... = vi2 = 0). The remaining values for case (a) are: 
^13,25 = 7.3 X 10®, bl4,25 = 1-4 X 10®, bl5,25 = 1-4 X 10®, bl6,25 = 
1.4 X lO’^ bits, bl 7 ,l 3 = ^21,25 = fel3,25> ^>18,25 = ^22,25 = bi4_25> 
bi9,25 = ^23,25 = ^15,13 and 1)20,25 = 1)24,25 = 1)16,25- In case (b), all 
the parameters are the same as case (a) except for l) 3 _i 5 = 1 ) 4 ,le = 1 ) 7,19 = 
1)8,20 = 1)11,23 = 1)12,24 = 11-4 Mbits, 1)14,25 = 1)15,25 = 1)16,25 = 
1)18,25 = 1)19,25 = 1)20,25 = 1)22,25 = 1)23,25 = 1)24,25 = 14.6 X 10^ 
bits, 1)13,25 = 1)17,25 = 1)21,25 = 7.3 X IQl bits and U 15 = Dig = 023 = 
4.6 X 10®, 1)16 = 1)20 = 1)24 = 3.6 X 10® and 1)25 = 3.42 X 10® CPU 
cycles. 

(a) and case (b) for L = 20 s. Here, we use the same 
definition for {ID, CP*, CP*^, UL, DL} as in Fig. It can be 
seen that in case (a), several communication and computation 
operations take place in parallel for a signihcant fraction of the 
time, and hence the parallel implementation is advantageous 
as compared to the serial implementation. Instead, for case (b) 
most of the time is spent for uplink transmissions and hence 
the opportunities for parallel processing are much reduced. 

In order to complement the insight obtained from the study 
of the call graph in Fig. here we elaborate on the impact 
of the structure of the call graph by considering the graph 
in Fig. [T] We plot the performance of the serial and parallel 
implementations for the call graph Q as well as for the 
subtrees 71 and 71 in Fig. The relative values of the 
parameters in the call graph Q is obtained from 0, and 
their exact values are dehned in the caption of this figure. As 
expected, the energy required to run the application for a given 
latency increases as one considers a larger call graph. More 
importantly, the opportunities for concurrent computations and 
communications are enhanced on larger subgraphs, and, as 
a result, for 71 and Q, parallel processing provides more 
substantial gain over the serial implementation than in 71. 

VII. Concluding Remarks 

In this paper, we studied the inter-layer optimization of 
cloud mobile computing systems over the power allocation at 
the physical layer and offloading decisions at the application 
layer with the aim of exploring the achievable trade-offs be¬ 
tween the mobile energy expenditure and latency. Unlike prior 
work in which the problem is formulated as a mixed integer 
program, here we proposed a message-passing framework that 
leverage the typical structure of call graphs to drastically 
reduce complexity. In particular, we focused on call graphs 
that can be decomposed into combination of a small number 
of subtrees when fixing the decisions of a subset of nodes, 
obtaining a complexity that grows exponentially only in the 
size of such set of nodes rather the size of the call graph. 
Moreover, unlike prior art, the framework is applied to both the 



Fig. 5. Energy and latency trade-off for the call graph Q in Eig. (case 
(a)). The program can be completely performed locally with E = 65.6 J and 
L = 164 s. Moreover, separate optimization for serial implementation yields 

= 9.7 J and L = 178 s. 

conventional serial implementation and a parallel implemen¬ 
tation that enables the concurrent schedule of communication 
and computation. Via simulation results, we demonstrated the 
impact of the call graph structure on the relative performance 
of the parallel and serial implementations, and shed light on 
the impact of inter-layer optimization. 
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Appendix 

In Sec. |y] we proposed an analytically convenient approx¬ 
imation for the energy and latency of the parallel implemen¬ 
tation. Here, we develop a dynamic model that enables the 
evaluation of upper bounds on the energy and latency of the 
parallel implementation for a hxed set of variables (I,P) by 
tracking the state of each task over time. To this end, we 
quantize the time axis similar to with a generally different 
time step By construction, the upper bounds calculated 
here become increasingly tighter as the quantization step Cd 
decreases. 

Dehne as Xn{k) the state of task node T„ at time instant 
tk = {k — l)ed- The state of each node remains constant 
in the time range {tk,tk+i] and may take any value in the 
set {ID, CM, CP*, CP*^, UL, DL}, where ID indicates that a 
task is idle in the sense that it has not started processing 
yet. Instead, CM indicates that a task is completed in terms 
of processing and uplink/downlink communication and other 
state are defined in Sec. |III-B| For all n € Vd, we initialize 
the state as X„(l) = CP*. 

To keep track of the state of the uplink and downlink 
transmissions, we define the following variables. The variable 
indicates the remaining information bits that task T„ 
still needs to send in the uplink at time For fc = 1, we 
have &“*(!) = f'rt,c(n) for tasks T„ that are not directly 
connected to a leaf node with /„ = 0 and Icin) = 1; instead, 
if 7„ = 1 and V{n) G Vd, we set b'^{k) = and we 
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Fig. 6. Energy and latency trade-off for the call graph Q in Fig. I^for case 
(a) and case (b). Separate optimization for the parallel implementation yields 
E = 22.5 J and L = 38.5 s for case (b) (not shown). 


have b'^{k) = 0 otherwise. Similarly, the variable for 

m G ’P(n) represents the remaining output bits of task 
that task T„ needs to receive in the downlink at time tk- For 
fc = 1, we have „(1) = bm,n for all pairs {m,n) such that 
In = 0 and Im = 1, and b^„{l) =0 otherwise. 

In order to track the state of the tasks in terms of computa¬ 
tions, we define as cl^{k) the number of CPU cycles that are 
left at time tk to finish a task T„ with /„ = 0, while c^(A:) 
denotes the corresponding number of remaining CPU cycles 
for a task T„ with /„ = 1. Thus, we have c5j(l) = if /„ = 0 
and dn{l) = Vn if In = 1, while we set cjj(l) = c^(l) = 0 
otherwise. 

Let us define N^k) as the number of tasks that are running 
locally and N'^{k) as the number of tasks that are running 
remotely at time tk- Similarly, we define and N‘^\k) 

as the number of concurrent uplink and downlink transmis¬ 
sions at time tk, respectively. In the proposed approach, as 
described below, we update the state Xn{k) of each task node 
by making the assumption that the quantities N\k), N'^{k), 
N'^\k) and remain constant through the time interval 

{tk,tk+i]- As argued below, this lead to the desired upper 
bounds on energy and latency. In the following, we treat 
separately the state update of each task T„ in any interval 
{tk,tk+i] depending on the state Xn{k) at time tk- 

If Xn{k) = UL, the amount of information that can be 
transmitted to the server in the time slot {tk,tk+i\ should be 
calculated in order to update the variable b'^{k)- If /„ = 1 we 
have = K'(fc)-(C“'(7V“'(fc)/V(n).«)/A^“'(fc))e]+ 

due to the uploading of information from the connected leaf 
node, where [a:]+ is equal to a; if x > 0 and x is equal to 0 
otherwise. Instead, if In = 0, we have h^ik -f 1) = [b^{k) — 
c(„))/A^“*(fc))e]+, due to the uploading of 
information to the child task Tc(n)- As a result, the state of 
the node changes as 

r UL if 1) >0 

Xn{k + l) = l CM if /„ = 0 and 1) =0 , 

( CP*- if /„ = 1 and -f 1) = 0 

(24) 

since when J„ = 0, the task is completed, and when J„ = 1, 



Latency 
Case (a) 



Latency 
Case (b) 


Fig. 7. Timeline for the parallel implementation corresponding to the 
optimum solution for L = 20 s for the call graph in Fig. (see Fig. [^. 


the task T„ needs to be computed remotely. 

Following similar consideration, if Xn{k) = DL, the state 
of the task node T„ can be updated as 


r DL 

Xn{k+l)=l CPl 
Moreover, if Xn(k) = 


if 6m,-I- 1) > 0 for any to £ V{n) 
if -I- 1) = 0 and Xmik) = CM 

for ail TO G V{n) 

(25) 


CP*, we have 


Xn{k -f 1) — 


Xn{k -f 1) — 


CP* 

if c^(fc -1-1) > 0 


UL 

if Fc(n) = 1 and 



c^(fc -1- 1) = 0 and n G V\'Fd 

5 

CM 

otherwise 

(26) 

P", we 

can write 



CP" ifc;(fc-Pl)>0 

(27) 

’"1 

CM ifc;(fc-Pl) = 0 ’ 


where c!^{k -f 1) is calculated as c^(A: -f 1) = [c^{k) — 
(f'^/N'^(k))e]'^- If Xn(k) = CM, we always have Xn(k -f 
1) = CM and, if X„(fc) = ID, we have 


Xnik-\-l) — < 


DL 
UL 
CP* 
CP" 
ID 


if /„ = 0 and Im = 1 for some 

TO € V{n) with Xm{k) = CM 

if /„ = 1 and Xm{k) = CM for all 

TO G V(n) and to G Vd 

if /„ = 0 and Im = 0 for all to G 'Pin) 

with Xn^ik) = CM 

if /„ = 1 and Xmik) = CM for all 

TO G 'P{n) and to G V\Vd 

otherwise 


(28) 
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Fig. 8. Energy and latency trade-off for call graph Q in Fig.^and the subtrees 
7i and 72 with vi = 0, V 2 = V 4 = vi 2 = 0.6 x 10^, vs = 0.24 x 10^, 

Vs = 0.4 X 10^, Vq = Vg = Vi 4 = 2 X 10^, Vy = Vs = 1.1 X 10^, 

viQ = 0.66 X 10®, wii = ^13 = 1 X 10®, Vis = 0.2 x 10® CPU cycles, 

^1,2 = ^3,5 = ^3,6 = ^5,10 = ^9,12 = ^11,14 = ^12,14 = 5 X 10®, 

62,3 = 15 x 10 ®, 62,4 = 9 . 7 x 10 ®, 64,7 = ^4 8 = 8 . 5 x 10 ®, 64,9 = 3 x 10 ®, 
be,10 = Sx 10 ®, 67,11 = 68,11 = l -'2 X 10 ^, 610,13 = 613,15 = 10 X 10 ® 
and 614,15 = 15.5 x 10 ® bits. 

Based on the discussion above, the values N^{k), 
N'^{k), N'*^\k) and N^’'{k) are calculated at each 

time tk according to the states of nodes as — 

E'nllH^nik) = UL), N^{k) = E7illUn(fc) = CP'), 
N^ik) = EnliHXnik) = CPO and = 

EL=iEr„GP(n) lUn(fc) = DL and > 

0 and Xm{k) = CM, where !(•) is the indicator function. 

Finally, at the end of each time interval (tfc, ffc+i] the energy 
consumed by the mobile is updated as 

E{k + l) =E{k) 

+ E E 1 {Xn{k) = DL and 

riGV rnGVin) 

bt,nW > 0 and X„(fc) = CM) 

(P™ + P^f)e (29) 

+ ^ 1 (X„(fc) = UL) (P„,c(n) + P^^)e 
nev 

+ E' = CP') 

nev ^ ' 

The latency is instead given by the smallest value such 
that X|v|(fc) = CM for the root node T|v|. We observe 
that ( |29] l assumes that transmissions and computations last 
for the period of duration even if the task completed at 
some time within the interval. This implies that ( [29| and the 
corresponding latency are upper bounds on the actual energy 
and latency that become increasingly tight as become 
smaller. 
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